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Abstract — We present new lower and upper bounds for the 
compression rate of binary prefix codes optimized over memory- 
less sources according to two related exponential codeword length 
objectives. The objectives explored here are exponential-average 
length and exponential-average redundancy. The first of these 
relates to various problems involving queueing, uncertainty, and 
lossless communications, and it can be reduced to the second, 
which has properties more amenable to analysis. These bounds, 
some of which are tight, are in terms of a form of entropy 
and/or the probability of an input symbol, improving on recently 
discovered bounds of similar form. We also observe properties of 
optimal codes over the exponential-average redundancy utility. 

I. Introduction 

Among Shannon's many observations in the seminal paper 
on information theory was that, by increasing block size, the 
compression rate of a block code for a memoryless source can 
get arbitrarily close to the source entropy rate. In particular, 
given a block of Shannon entropy H bits, prefix coding 
methods such as Huffman coding can code the block with 
an expected length L, where L G [H,H + 1). If p { e (0, 1) 
is the probability of the ith item, which has a codeword of 
length li, then 



L = ^^Pih and H = 



Pi lgP» 



where lg = log 2 and the sum is, without loss of generality, 
taken over the n possible items. A constant absolute difference 
translates into an arbitrarily close-to-entropy compression ratio 
as blocks grow in size without bound. The lower bound is 
fundamental to the definition of entropy, while the upper bound 
is easily seen by observing the suboptimal Shannon code. This 
code, that in which an event of probability p is coded into a 
codeword of length [~— Igp], will always have expected length 
less than H + 1 and never have expected length less than L. 

This unit-sized bound is preserved even for many nonlinear 
optimization criteria. Such criteria are encountered in a variety 
of lossless compression problems in which expected length is 
no longer the value to minimize. In particular, consider 



L a = L a (l,p) = log a £ 



Pia 



(1) 



i=l 



Minimizing this utility solves several problems involving com- 
pression for queueing [1|, compression with uncertainty []2], 



one-shot communications [3|, and unreliable communications 
fl4). It is closely related to Renyi entropy 



n 



(2) 



in the sense that, for a = 1/(1 + lga), 

H a (p) <i° pt <H a {p) + l. 
Limits define Renyi entropy for 0, 1, and oo, so that 

ff (p)=limff a (p) = lg||p|| 

ai.0 

(the logarithm of the number of events in p), 

n 

Hi(p) = lim H a (p) = - YV lg pi 

a^l *■ — ' 

8=1 

(the Shannon entropy of p), and 

#oo(p) = lim H a (p) = -lgmaxpj 

(the min-entropy). Over a constant p, entropy is nonincreasing 
over a 0. 

L a is also closely related to exponential-average redun- 
dancy or exponential redundancy 



+igp») 



If we substitute d — lg a and 

Pf 



Pi 



we find 



i n 

R*°(p,i) = -Li g W +ls V« 

lga 



(3) 



= L a {l,p) ~ H a (p). 

This transformation — shown previously in [6] — provides a 
reduction from L a to R d , allowing bounds for the former to 
apply — with the addition of the entropy term — to the latter. 

For both the traditional and exponential utilities, we can 
improve on the unit-sized bound given the probability of one 



of the source events. This was first done with the constraint 
that the given probability be the most probable of these events 
0, but here, as in some subsequent work [4|, [8|, [9|, we drop 
this constraint. Without loss of generality, we call the source 
symbols {1,2,..., n} = X (from most to least probable), and 
call the symbol with known probability j; that is, pj is known, 
but not necessarily j itself. 

In traditional linear optimization, upper and lower bounds 
for R d are known such that probability distributions can be 
found achieving or approaching these bounds [8 1, [9 1; i.e., they 
are tight. In the exponential cases, [4] took a f oo (d t oo) and 
a 4- 1 (d 4- 0), using inequality relations to find not-necessarily- 
tight bounds on these problems in terms of tight bounds for 
the limit cases. The goal here is to improve the bounds. 

We seek to find an upper bound u> d (pj) and lower bound 
o d (pj) such that, for every probability distribution p, optimal 
codeword lengths I satisfy: 

< o d {p 3 ) < minR d (p,l) < u d {p 3 ) < 1 

for any j. For such values, OJ results in: 

O i oga ^a 2 (a-i)ff a (p)^ < L°pt(p)_# a (p) 

where a = 1/(1 + Ig a) and L° pt (p) denotes the utility for 
optimal lengths given p and a. Thus we can restrict ourselves 
to exponential redundancy, which is more amenable to the 
analysis used here. 

II. Applications 

A. d > (a > 1) 

Most applications of the exponential length utility concern 
only a > 1 (d > for the redundancy equivalent). The first 
known application, introduced in Humblet's dissertation JH, 
1 10 1, is in a queueing problem originally posed by Jelinek [fTTI . 
Codewords coding a random source are temporarily stored in 
a finite buffer; these are chosen such that overflow probability 
is minimized. 

Another application considers a source with uncertain prob- 
abilities, one in which we only know that the relative entropy 
between the actual probability mass function and p is within 
a known bound |2|. A third, more recent application, omitted 
in the interest of brevity but described in [4 |, is a modified 
case of the application in the next paragraph. 

B. d<0 (a<l) 

An application for a < 1 involves single-shot communi- 
cations with a communication channel having a window of 
opportunity of geometrically-distributed length (in bits) Q. If 
the distribution has parameter a, the probability of successful 
transmission is 

n 

P[success] = a La(pJ) = ^p % a u . 

i=i 

Maximizing this is equivalent to minimizing (fl}. The solution 
is trivial for a < 0.5 (d < — 1), a case not covered by Renyi 
entropy, and thus not applicable here. 



III. Bounds 

The variation of the Huffman algorithm which finds an opti- 
mal code for exponential redundancy differs as follows: While 
Huffman coding inductively pairs the two lowest probabilities 
(weights) w x and w y , combining them into an item weighted 
f(w x ,w y ) = w x + w y , optimizing exponential redundancy 
requires the combined item to be weight 

f d (w x , w y ) 4 (2 d wl+ d + 2 d Wy + d ) . (4) 

The optimality of this is shown in [ 12 1 and can illustrated with 
an exchange argument (e.g., |[T3] pp. 124-125] for the linear 
case). An exchange argument also inductively illustrates that 
such an algorithm, depending on how ties are broken, can 
achieve any optimal set of codeword lengths: Clearly the only 
optimal code is obtained for n = 2. Let n' be the smallest n 
for which there is a set of {k} that is optimal but cannot be 
obtained via the algorithm. Since {U} is optimal, consider the 
two smallest probabilities, p n i and p n '-x- In this optimal code, 
two items having these probabilities (although not necessarily 
items n' — 1 and n') must have the longest codewords and 
must have the same codeword lengths. Otherwise, we could 
exchange the codeword with a longer codeword corresponding 
to a more probable item and improve the utility function, 
showing nonoptimality. Merge these two items into one with 
probability f d (p n ' ,p n ' -i)> as P er the algorithm. Because of 
the nature of f d , this is a reduced problem, i.e., an equivalent 
optimization to the original problem. This means that there is 
a set of lengths optimal for this problem such that all non- 
merged items are identical to the corresponding l i7 while the 
merged item is simply one shorter than the longest ij. Since we 
inductively assumed all optimal length sets could be produced 
for n' — 1, the assumption is verified for all n. 

Related observations form the following theorem, similar to 
that in [4] for a non-exponential utility: 

Theorem 1: Suppose we apply (01 to find a Huffman- 
like code tree in order to minimize exponential redundancy 
R d (p,l) for d > — 1. Then the following holds for any 
optimal I: 

1) For d > 0, items are always merged by nondecreasing 
weight and the total probability of any subtree is no 
greater than the weight of the (root of the) subtree. For 
d < 0, the total probability of any subtree is no less than 
the weight of the subtree. 

2) The weight of the root of the coding tree is w roo t = 
2 n d (p,i) t 

3) If Pi < f (pn-i)Pn), men an optimal code can be 
represented by a complete tree, that is, a tree with leaves 
at depth |_lg n\ and [lg n\ only (with J^. 2~ li = 1). 

Proof: Again we use induction, this time using trivial base 
cases of sizes 1 and 2, and assuming the propositions true for 
sizes n — 1 and smaller. We assume without loss of generality 
that, for size n, items n — 1 and n are the first to be merged. 
We use weight terminology (w) instead of probabilities (p) 
because reduced problems need not have weights sum to 1. 



The subtree part of the first property considers subtrees of 
size n, not necessarily the whole coding tree. All we need 
to have a successful reduction to size n — 1 is to show the 
following: 



f d (w x ,w y ) = (2V +d + 2<X+V 



> 



for d > 0, and 



f d (lU X ,Wy) <W X +Wy 



(5) 
(6) 

(7) 




for d G (—1,0), with equality in either case if and only if 
w x — w y . The inequalities are due to the identical property of 
the generalized mean in [14, 3.2.4]: 



M(t) = 



with, in this case, m = 2, ai = 2w x , a 2 = 2w y , and t as 
1 + d in (O (left-hand side of (|7J) and 1 on ® (right-hand 
side of 0). 

It immediately follows in the d > case that f d (w x , w y ) > 
w x . Thus, the first two weights of the entire tree merge form 
a weight no less than either original weight, and all remaining 
weights are also no less that those two weights. Call the 
resulting lengths l'. 

To prove the second property, note that, after merging the 
aforementioned two least weighted items, we have n — 1 
weights, and thus a conforming reduced problem. Call the 
combined weight w' c . Then 

2 R d ( P ,l) 



Wroot 



/ 1 + d 9 (i„-l)d 



«'. 



i=l 



+ dcylid 



n—1 



, i f d^. n d _|_ pl+d^hd 



i=l 



2 R M 



where the third equality is due to l n -i — l n an d ©. 

The third property is shown via the operation of the algo- 
rithm from start to finish: First note that £\ 2 4 = 1 for 
any tree created using the Huffman-like procedure, since all 
internal nodes have two children. Now think of the procedure 
as starting with a priority queue of input items, ordered by 
nondecreasing weight from head to tail. After merging two 
items, obtained from the head, into one compound item, that 
item is placed back into the queue. Since we are using a 
priority queue, the merged item is placed such that its weight 
is no smaller than any item ahead of it and is smaller than any 
item behind it. 

In keeping items ordered, we obtain an optimal coding 
tree. A first derivative test shows that f d is nondecreasing 
on both inputs for any d. Thus merged items are created in 
nondecreasing weight. If pi < f d (p n -i,Pn), the first merged 
item can be inserted to the tail of the queue; since merged 



items are created in nondecreasing weight, subsequent items 
are as well. This is a sufficient condition for a complete tree 
being optimal [3] Lemma 2]. ■ 
Next is our main result: 

Theorem 2: Suppose we know d > —l(d^0) and one 
Pj of probability mass function p for which we want to find 
the optimal code I under exponential redundancy. Consider 
functions 

making transitions between A and A + 1 at 



P\ - 
and 



J ( (l - 2 d ) ( {2 xl 1} d ~ [2*=o3p 



1\ l + d 



1 



o a ( P] )= mm ( M +5lg(pl+* 



(1-P) 



l+d 



with transitions between /z and fi + 1 at 



P*= (l+(V-l) ((2^1 



-1\ l + d 



(2^-0. 5) c 

These improve bounds on the optimal code, and the upper 
bound is a strict inequality, in that 

0<o d ( Pj ) <R d ( P ,l) <u d ( Pj ) < 1. 

Moreover, the lower bounds are achievable given p\ and the 
upper bounds are approachable given p\ > 0.5. In addition, 
for pj < 0.5 and d < 0, we have the following secondary 
upper bound: 



R d (p, I) < max [0.5, i lg (p] +d 4 d + (1 - Pj ) 1+d 2 d ) 
Proof: 

1) Lower bound: The lower bound calculation is: 



(9) 



R 



l M = ^Y,Pi +d2dli 



iex 



ilg (p)+ d 2 dl i + (1 - Pj ) 1+d 2 dl " ■ 



E 2 " 



-U 



iex\{j} 



Pi2 



1- Pj 

Pi, 



E if Of"' 

iex\{j} k=i v 



Pj 



l+d 



(1 - Pj ) 1+d 2 dl » (2'" -2 ; "-^)" 

i ] + \\ g (pY d +{i-p j r d {2^-i)- d ) 



-0.5 
-1 



Pi 

(a) Upper bounds 



(b) Lower bounds 



Fig. 1. Bounds on optimal -Ro pt (p) given pj over various d (see legends). The thick (dash-dotted) lines correspond to the usual linear redundancy utility 
(d — > 0), while the uppermost (solid) lines are minimum maximum pointwise redundancy (d — > oo). Lower bounds are tight over all d > —1, while upper 
bounds are only tight for minimum maximum pointwise redundancy, for pj > 0.5 if d 6 (—1, oo), and for (0, ttq) if d 6 (—1, 0), where 7Tq as the first 
root of the equality of the two terms in the maximization at (|9j- The tight upper bounds for d < oo are approached by p = (pj, 1 — pj — e, e). 



The first equality is due to the definition, while the other 
equalities follow from algebra. The summation following (a) 
is a sum of the (1 + d) power of 2 ln — 2 ln ~ lj positive terms 
which sum to 1. Consider these values, which include 2 ln ~ li 
repetitions of each pi2 li ~ ln / (1 —pj) for i ^ j, as a probability 
distribution called q. Then the summation is related to the 
(1 + c?)-Renyi entropy of q; substituting using its definition 
(O leads to (fT0T > below. Furthermore, because Ho(q) —\g\\q\\ 
and H a is nonincreasing with a, ( [Tol l is bounded as follows: 



Satisfying the Kraft inequality, it is a valid — possibly 
suboptimal — code, and thus has a utility that upper-bounds 
that of the optimal code. Thus: 

R d ( P ,l) = ilg]>>! +rf 2* 

iex 



1 



E 

rn—l 



(10) 



> 



INI 



(2'™ - 2 i »-^)- 1 . 



This results in inequality (b), completing the lower bound by 
substituting minimizing /i for lj. The transitions follow from 
algebraically finding where there are two minimizing values. 
A code achieving this lower bound, for p\ — pj G 

[l/(2^ +1 - 1), 1/2^) for some p, is 

/ \ 

I- Pi 



Pi- 



2^+1 



2M+1 



V 



2f + 1-2 

By Theorem [T] this has a complete coding tree — recall 
f d (w x ,w x ) — 2w x — in this case with l\ one bit shorter 
than the other lengths. This is easily calculated as achieving 
the lower bound. 

2) Upper bounds: Consider the following code for an 
arbitrary A, as in (8): 



ii(p) = 



A, 



-lg [P: 



l-2~* 
1- Pj 



« =3 
i +3 



p l+d 2 d \- i &(p^ 1 ~ 2 ~ X ')/( 1 -Pi'))~\ 

iex\{j} 



l+d d\ 



< 5 ig *r* 



r i+d (P^ 1-2 

iex\{j] 



—A 



2 1- Pj 



-lg[p] +d 2 dx + (l- Pj 



1-2 



-A 



Since A is arbitrary, the bound is obtained by choosing 
the value offering the strictest bound. This upper bound is 
approached for any d > — 1 over p\ = pj e (0.5,1) for 
P = (Pj, 1 - Pj - e, e) (i-e., j = 1 and A = 1). 

Now consider d < and pj < 0.5. As noted in J4), an 
application of Lyapunov's inequality for moments ifTTl p. 27] 
yields R d (p,l) < R d (p,l) for d' < d, and, in particular, 
R d {p,l) < R°(PjI) in this case, where 



R°(p,l) = J2Pih-Hi(p) 

iex 



via limits. Since this is true for all values, it is true over the 
minimization, and bounds for the usual linear case apply here. 



In particular, as found in [18| and noted in [9|, if we define 

3-5pi-ffi(2p x ) 7Ti<pi<0.5 



f(pi 



2-lg3 



< pi < 7Ti 



(ID 



where m ~ 0.491 is the root of the equality of the two terms, 
then this serves as an upper bound (given most probable pi) on 
optimal redundancy (linear, and thus also d < 0) in (0, 0.5). 

Since this never exceeds the bound we seek here, we can 
now consider only pj < p\. Consider first those cases in which 
(0 is greater than 0.5. In these cases, we use the fact that 
Pi G [pj, 1 — Pj) to note that the maximum upper bound over 
this range — using ([8} and (ITTb — is u d (pi) at p\ = 1 — Pj, 
thus supplying the upper bound for the range (0, 7Tq), where 
7Tq is the first root of the equality of the two terms in the 
maximization at (9). 

Overpj G (4,0.5), we first note that 0.5 is an upper bound 
via similar logic: If p\ < 0.5, we already know that this is 
an upper bound. Otherwise pi £ (0.5, 1 — 7Tq), and ^ using 
j = 1 provides an upper bound not exceeding 0.5. ■ 

Fig. Q] illustrates these bounds at a handful of values, and 
at limits —1, 0, and oo. For d — > 0, l'Hopital's rule reveals 
the lower bound to be the optimal one of Theorem 2 of 
lfl5l for j = 1 and Theorem 4 of [9| for arbitrary j. If one 
replaces optimal A with (possibly suboptimal) |~— \gPj], the 
upper bound becomes the suboptimal one of Lemma 1 of |8|. 
Taking d — > oo using, for any positive x, y, a, b, 

lim \ \g(xa d + yb d ) = lgmax(a, 6) 

d— >oo d 

yields the optimal bounds of j4|, which are both tight. 

The upper bound is clearly not optimal here, since it is 
not optimal for d — > from either direction. However, the 
following fact might be of help in improving this in future 
work: 

Theorem 3: If d < and p\ > 0.4, an optimal code exists 
with li = 1. 

Proof: The approach here is similar to [16|. Consider the 
coding step at which item 1 gets combined with other items; 
we wish to prove that this is the last step. At the beginning 
of this step the (possibly merged) items left to combine are 
{1}, S$, S§, . . . , S%, where we use S* to denote the set of 
(individual) items combined into a (possibly) compound item, 
and w(Sj) to denote its weight. At this step, p\ is smaller 
than all but possibly one of Sj, so (k — \)p\ > (k — 1)0.4 
is less than the sum of weights, which in turn is less than or 
equal to 1. Thus k is at most three. 

Consider items {1}, Sf, and 5f. Assume without loss of 
generality that > w(S^). If itf(5f) is not compound, 

{1} has the greatest weight and we are finished. If it is 
compound, call its two subtrees 5*| and Sf, in order of 
nonincreasing weight. Clearly w(S^) < w(S^) due to the 
combination order, so -u^Sf) < 2w(S^). Thus 1.5w(5|) < 
w(S$) + w(Si) < 0.6, so w(Sf ) < w(S$) < 0.4, and we can 
combine these two items to achieve the optimal code. This is 
tight in the sense that (pi, (1 — pi)/3, (1 — pi)/3, (1 — Pi)/3) 
has h=2 for p 1 6 (0.25, 0.4). ■ 



As an example of the improvement these bounds offer, we 
revisit the examples of [4], which consider minimizing L a 
over Benford's distribution [19], [20|: 



P, 



= log 10 (* + l)-Iog 10 (*), i= 1,2,... 9 



for a — 0.6 and a = 2 given p\. The bounds of [4] show that 
optimal Lo.6 for such a p\ must lie in [2.372 . . . , 2.707 . . .). 
This is identical to the application of the current result, which 
should not surprise, as the prior bounds apply and are tight in 
cases where we can show — as in this case — that l\ = 1. 
A more interesting case is that of a = 2, for which the prior 
bounds, [3.039 3.910 .. .], are superseded by the tighter 
[3.051 . . . , 3.863 . . .); optimal L 2 = 3.099 . . .. 



References 



fi] 



P. A. Humblet, "Generalization of Huffman coding to minimize the 
probability of buffer overflow," IEEE Trans. Inf. Theory, vol. IT-27, 
no. 2, pp. 230-232, Mar. 1981. 

F. Rezaei and C. D. Charalambous, "Robust coding for uncertain 
sources: A minimax approach," in Proc, 2005 IEEE Int. Symp. on 
Information Theory, Sept. 4-9, 2005, pp. 1539-1543. 
[3] M. B. Baer, "Optimal prefix codes for infinite alphabets with nonlinear 
costs," IEEE Trans. Inf. Theory, vol. IT-54, no. 3, pp. 1273-1286, Mar. 
2008. 

[4] , "Redundancy-related bounds for generalized Huffman codes," 

IEEE Trans. Inf. Theory, vol. IT-57, no. 4, Apr. 201 1, to appear; available 

from http://arxiv.org/abs/cs.IT/0702059 
[5] A. Renyi, "Some fundamental questions of information theory," Magyar 

Tudomdnyos Akademia III. Osztalyanak Kozlemenei, vol. 10, no. 1, pp. 

251-282, 1960. 

[6] A. C. Blumer and R. J. McEliece, "The Renyi redundancy of generalized 
Huffman codes," IEEE Trans. Inf. Theory, vol. IT-34, no. 5, pp. 1242- 
1249, Sept. 1988. 

[7] R. G. Gallager, "Variations on a theme by Huffman," IEEE Trans. Inf. 
Theory, vol. IT-24, no. 6, pp. 668-674, Nov. 1978. 
C. Ye and R. W. Yeung, "A simple bound of the redundancy of Huffman 
codes," IEEE Trans. Inf. Theory, vol. IT-48, no. 7, pp. 2132-2138, July 
2002. 

S. Mohajer, S. Pakzad, and A. Kakhbod, "Tight bounds on the redun- 
dancy of Huffman codes," in Proc, IEEE Information Theory Workshop, 
Mar. 13-17, 2006, pp. 131-135. 
[10] P. A. Humblet, "Source coding for communication concentrators," Ph.D. 
dissertation, Massachusetts Institute of Technology, 1978. 
F. Jelinek, "Buffer overflow in variable length coding of fixed rate 
sources," IEEE Trans. Inf. Theory, vol. IT-14, no. 3, pp. 490-501, May 
1968. 

[12] D. S. Parker, Jr., "Conditions for optimality of the Huffman algorithm," 

SIAM J. Comput., vol. 9, no. 3, pp. 470-489, Aug. 1980. 
[13] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. New 

York, NY: Wiley-Interscience, 2006. 
[14] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions 

with Formulas, Graphs, and Mathematical Tables. Mineola, NY: Dover 

Publications, 1964. 

[15] B. L. Montgomery and J. Abrahams, "On the redundancy of optimal 
binary prefix-condition codes for finite and infinite sources," IEEE Trans. 
Inf. Theory, vol. IT-33, no. 1, pp. 156-160, Jan. 1987. 
[16] O. Johnsen, "On the redundancy of binary Huffman codes," IEEE Trans. 

Inf. Theory, vol. IT-26, no. 2, pp. 220-222, Mar. 1980. 
[17] G. H. Hardy, J. E. Littlewood, and G. Polya, Inequalities. Cambridge, 

UK: Cambridge Univ. Press, 1934. 
[18] D. Manstetten, "Tight bounds on the redundancy of Huffman codes," 
IEEE Trans. Inf. Theory, vol. IT-37, no. 1, pp. 144-151, Jan. 1992. 
S. Newcomb, "Note on the frequency of use of the different digits in 
natural numbers," Amer. J. Math., vol. 4, no. 1/4, pp. 39^0, 1881. 
F. Benford, "The law of anomalous numbers," Proc. Amer. Phil. Soc, 
vol. 78, no. 4, pp. 551-572, Mar. 1938. 



[2] 



[8] 



[9] 



[11] 



[19] 



[20] 



